Skip to content

[AUTOGENERATED] develop_IFU_20260116#2911

Closed
pragupta wants to merge 1985 commits intodevelopfrom
develop_IFU_20260116
Closed

[AUTOGENERATED] develop_IFU_20260116#2911
pragupta wants to merge 1985 commits intodevelopfrom
develop_IFU_20260116

Conversation

@pragupta
Copy link
Copy Markdown
Collaborator

rocm_base: f742da3

pytorchmergebot and others added 30 commits January 12, 2026 06:41
This reverts commit 0f9c766.

Reverted pytorch#170531 on behalf of https://github.com/jeanschmidt due to seems to break internal signals, see D90440132 ([comment](pytorch#170531 (comment)))
This reverts commit c6583cb.

Reverted pytorch#169551 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal signals, see D90448078 ([comment](pytorch#169551 (comment)))
…ch#169550)"

This reverts commit 649d9b3.

Reverted pytorch#169550 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal signals, see D90448078 ([comment](pytorch#169550 (comment)))
…ard-compatible (pytorch#172176)"

This reverts commit 084f69f.

Reverted pytorch#172176 on behalf of https://github.com/jeanschmidt due to sorry, need to revert in order to revert pytorch#169549, please feel free to re-merge this change once rebased ([comment](pytorch#172176 (comment)))
…m_mesh (pytorch#169549)"

This reverts commit 75ce1e9.

Reverted pytorch#169549 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal signals, see D90448078 ([comment](pytorch#169549 (comment)))
…ytorch#169549)

For compile on one rank we need to be able to compute the _coordinate_on_dim based on the Tensor and rank (because we don't have the DeviceMesh in the graph). So this PR factors out the _coordinate_on_dim logic into its own function.

Also defined a `RankType` alias to represent rank-like types (union of `int` and `SymInt`).

Another issue is that _compute_local_shape_and_global_offset() is passed an entire coordinate array but really only needs a single value - so change it to take a lambda to return the desired coordinate. Often this is just DeviceMesh.sym_get_coordinate (from the previous PR)

Pull Request resolved: pytorch#169549
Approved by: https://github.com/ezyang
For compile on one rank we need to be able to compute the DeviceMesh rank Tensor based on the raw Tensor and current rank. So this PR factors out `DeviceMesh._get_mesh_tensor_from_full_mesh()` into a static method.

Pull Request resolved: pytorch#169550
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#169549
`Placement._split_tensor()` computes and returns too much information - in general most callers call it and then throw away most of the results. Added `Placement._select_split_tensor()` which allows the caller to say which parts they want so we can compute only those bits - in essence it is the combination of `Placement._split_tensor()` and `Shard._select_shard()`.

Pull Request resolved: pytorch#169551
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#169549, pytorch#169550
…atible (pytorch#172176)

Fix for pytorch#169549: An internal user was calling `_compute_local_shape_and_global_offset()` directly so I figured it was safer to make the API backward compatible.

Pull Request resolved: pytorch#172176
Approved by: https://github.com/bobrenjc93
This reverts commit 64edaa7.

Reverted pytorch#170645 on behalf of https://github.com/jeanschmidt due to _Snapshot type definition does not contain all fields returned by _cuda_memorySnapshot, we probably want to include all to avoid breaking more strict type checks, like internal ones ([comment](pytorch#170186 (comment)))
…ytorch#170186)"

This reverts commit 81bc0ab.

Reverted pytorch#170186 on behalf of https://github.com/jeanschmidt due to _Snapshot type definition does not contain all fields returned by _cuda_memorySnapshot, we probably want to include all to avoid breaking more strict type checks, like internal ones ([comment](pytorch#170186 (comment)))
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#172023
Approved by: https://github.com/Lucaskabela
In this PR, we add support for tracing through autograd.grad. Some changelogs:

- Add trace_autograd_ops config (default False)
- Add handler for autograd.grad that checks for external grad_fn inputs
- Add validation for compile inputs/outputs:
   - Graph break when autograd.grad inputs are reachable to any inputs that have external grad fn, this is because torch.compile won't understand the external relations outside compiled regions
   - Graph break when compile outputs are reachable from autograd.grad outputs as we will run into awkward double backward error.
- Add test_fwd_loss_bwd.py with tests for the new functionality
- Graph break on GradientEdge.

Pull Request resolved: pytorch#169493
Approved by: https://github.com/ezyang
Following pytorch#167729, one natural extension is that we should able to trace through tensor.backward. When we do tensor.backward, we rewrite it as autograd.grad + torch.ops.inductor.accumulate_grad_ in dynamo. When there is no explicit parameters being passed in to backward, we just look at dynamo graph inputs and find all input parameters. This works because autograd.grad will just return None for the unused ones.

The change looks quite big but there are several small/independent fixes:
1. Have to put allow_unused parameter to autograd.grad when we are automatically query-ing parameters because we are passing in all possible parameters that may or may not correspond to particular autograd.grad call
2. We try to collect in-graph created parameters
3. We use node proxies to deduplicate parameters that might be aliasing each other
4. If user passes in non-leaf parameter to backward, we graph break because generic .grad handling in dynamo is limited today. Will do as a follow up.
2. If we are calling backward on a tensor with no grad fn, we should error.

Pull Request resolved: pytorch#169415
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#169493
This reverts commit f2c5f9f.

Reverted pytorch#169415 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#169415 (comment)))
Accelerate dataclasses using 3.10 dataclass slots kwargs

Pull Request resolved: pytorch#172172
Approved by: https://github.com/XuehaiPan, https://github.com/Lucaskabela
…y default for non-FBCODE (pytorch#168175)

pytorch#153556 claimed to do this but backed out of doing it in the end, retrying...

CC @nikitaved

Pull Request resolved: pytorch#168175
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/malfet
Implements cholesky inverse
Pull Request resolved: pytorch#172186
Approved by: https://github.com/malfet
So far a no-op, will be used later to gate Metal-4 capabilities
Pull Request resolved: pytorch#172229
Approved by: https://github.com/Skylion007
…72247)

Reduce noise. Original intention was to flag functions that do not have annotations. But given we disallow overloads now in torchlib, this warning becomes irrelevant.

Pull Request resolved: pytorch#172247
Approved by: https://github.com/titaiwangms
)

We need sharding for this job, but this is a quick mitigation for https://github.com/pytorch/pytorch/actions/runs/20633756042/job/59256709062.  Let's see if 12 hours is enough to finish the job.

Pull Request resolved: pytorch#172158
Approved by: https://github.com/jainapurva
Fixes pytorch#144836

Summary:

- Fixes `MaxUnpool` crash when input tensors are small.

Pull Request resolved: pytorch#169359
Approved by: https://github.com/cyyever, https://github.com/isuruf
To truly get cross compile, we need to be able to initialize a cuda device mesh without an actual device. This PR allows us to do that.

Pull Request resolved: pytorch#171830
Approved by: https://github.com/aorenste
Mark more dataclasses frozen as slots only.

Pull Request resolved: pytorch#172173
Approved by: https://github.com/aorenste
malfet and others added 23 commits January 16, 2026 15:21
Remove 7.0 from TORCH_CUDA_ARCH_LIST but add Hopper there

Partially addresses pytorch#172351
Pull Request resolved: pytorch#172598
Approved by: https://github.com/atalman
Previously we missed the edge case where we create fake buffers/parameters before compiling_context kicks in. As a result, our fake tensor instrumentation didn't work since it needs to know if we are under "is_exporting".
Also hardened the warning message bit more when there is no available stack trace. (in the test case i added buffer is an input/placeholders and they don't have stacktrace.  Therefore, we try to find nodes that reference the buffer)
Pull Request resolved: pytorch#172597
Approved by: https://github.com/angelayi
…d symbolic_hint (pytorch#172429)"

This reverts commit 19d8ad0.

Reverted pytorch#172429 on behalf of https://github.com/yangw-dev due to sorry but it seems like this breaks internal test, error: AttributeError: 'ShapeEnv' object has no attribute 'var_to_val'. Did you mean: 'val_to_var'? ([comment](pytorch#172405 (comment)))
This reverts commit 56f2281.

Reverted pytorch#172405 on behalf of https://github.com/yangw-dev due to sorry but it seems like this breaks internal test, error: AttributeError: 'ShapeEnv' object has no attribute 'var_to_val'. Did you mean: 'val_to_var'? ([comment](pytorch#172405 (comment)))
Fix windows-arm64-build-test build error.
https://github.com/pytorch/pytorch/actions/workflows/win-arm64-build-test.yml
https://github.com/pytorch/pytorch/actions/runs/20908075245/job/60065482014
```
Configuring libuv...
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
-- Building for: Visual Studio 17 2022
  Compatibility with CMake < 3.5 has been removed from CMake.
```

Cmake 4.0 doesn't support older version than 3.5.
`Changed in version 4.0: Compatibility with versions of CMake older than 3.5 is removed.`
https://cmake.org/cmake/help/latest/command/cmake_minimum_required.html

libuv 1.39 use cmake 3.4 which is not supported by cmake 4.x.
`cmake_minimum_required(VERSION 3.4)`

Update libuv version to use newer cmake.
https://github.com/libuv/libuv/blob/v1.47.0/CMakeLists.txt
`cmake_minimum_required(VERSION 3.9)`

Pull Request resolved: pytorch#172203
Approved by: https://github.com/malfet
This reverts commit d05e0f0.

Reverted pytorch#166492 on behalf of https://github.com/yangw-dev due to sorry, a previous commit introduce the new test file breaks internal system, please rebase after pytorch#169121 's revert and reland ([comment](pytorch#166492 (comment)))
…for pytorch#172231 (pytorch#172340)"

This reverts commit d803917.

Reverted pytorch#172340 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#172340 (comment)))
…nel launcher for XPU backend. (pytorch#169938)"

This reverts commit 89a1835.

Reverted pytorch#169938 on behalf of https://github.com/yangw-dev due to the bottom stack break internal system, and casued SEV, please reachout internal staff for more details ([comment](pytorch#169938 (comment)))
…cpp module. (pytorch#168952)"

This reverts commit 96c7aa5.

Reverted pytorch#168952 on behalf of https://github.com/yangw-dev due to break internal system, and casued SEV, please reachout internal staff for more details ([comment](pytorch#169121 (comment)))
…static triton launcher. (pytorch#169121)"

This reverts commit 84646f7.

Reverted pytorch#169121 on behalf of https://github.com/yangw-dev due to break internal system, and casued SEV, please reachout internal staff for more details ([comment](pytorch#169121 (comment)))
…ytorch#167456)"

This reverts commit 3b0e35b.

Reverted pytorch#167456 on behalf of https://github.com/yangw-dev due to sorry, a previous commit have issues to land internally, please rebase and try again ([comment](pytorch#167456 (comment)))
This reverts commit 48b5311.

Reverted pytorch#167696 on behalf of https://github.com/yangw-dev due to sorry, the imported diff has merge conflict internally, please fix it and reland ([comment](pytorch#167696 (comment)))
Two issues are fixed here:

- skip operator aten.record_stream.default, otherwise correct uses of tensor.record_stream(stream) are flagged.
- if a Tensor's data_ptr() is 0, track its id() instead to avoid errors in fbgemm using 0-element Tensors as placeholders

This PR replaces pytorch#168313 that was blocked by infra issue of pytorchbot PAT and ROCm fork.
Pull Request resolved: pytorch#172562
Approved by: https://github.com/Mellonta
…hen segments are few but large (pytorch#172454)

Summary:
Current 2-step sum and scatter approach is not efficient when the number of segments is small but each segment is large. For example, we observed in our prod job that there is a sum_and_scatter kernel taking 40ms, but the launch grid size is 100 on Blackwell GPU (148 SMs). 1/3 of the SMs are idle why the other SMs are overloaded. And the kernel is bottlenecked by the largest segment in our case.
To improve the efficiency in this case, we add a fused atomic add kernel with grid size proportional to max_partial_segment instead of max_segments, giving much better parallelism when segments are few but large

Test Plan:
Test
Added unit test to cover the atomic accumulate kernel path. The large atol and rtol thresholds in the tests are needed for both original path and new atomic accmulate path.
- AMD MI300: https://www.internalfb.com/intern/testinfra/testrun/12666374093382243

- Verified numerical parity and perf improvement with e2e workflow.

Baseline: fire-renqincai-20251210-2133-a14d8be3, sum_and_scatter kernel takes about 40ms per iteration.
After optimization: fire-yjyao-20251214-1731-faeab2e0, reduced to about 6ms.
4% qps gain, NE on par

Differential Revision: D90689388

Pull Request resolved: pytorch#172454
Approved by: https://github.com/yangw-dev
It used to execute masked body even if mask was false, which lead to out-of-bounds access
Before this change compiling `torch.ops.aten._unsafe_masked_index` generated following Metal kernel
```metal
kernel void generated_kernel(
    device float* out_ptr0,
    constant bool* in_ptr0,
    constant long* in_ptr1,
    constant float* in_ptr2,
    uint xindex [[thread_position_in_grid]]
) {
    int x0 = xindex;
    auto tmp0 = in_ptr0[x0]; // mask
    auto tmp1 = in_ptr1[x0]; // index
    if ((tmp1 < 0) && (tmp1 > 8)) return; // check_bounds, regardless of mask
    auto tmp3 = in_ptr2[tmp1];  // load
    auto tmp4 = tmp0 ? tmp3 : 1;
    out_ptr0[x0] = static_cast<float>(tmp4);
}
```
After
```metal
kernel void generated_kernel(
    device float* out_ptr0,
    constant bool* in_ptr0,
    constant long* in_ptr1,
    constant float* in_ptr2,
    uint xindex [[thread_position_in_grid]]
) {
    int x0 = xindex;
    auto tmp0 = in_ptr0[x0];
    float tmp1;
    if (tmp0) {
        auto tmp_scoped_0 = in_ptr1[x0];
        if ((tmp_scoped_0 < 0 || tmp_scoped_0 >= 8)) return;
        auto tmp_scoped_2 = in_ptr2[tmp_scoped_0];
        tmp1 = tmp_scoped_2;
    } else tmp1 = 1;
    out_ptr0[x0] = static_cast<float>(tmp1);
}
```

When emitting `body` part of the graph, reset Scoped CSE variable counter and changed var prefix to `tmp_scoped_`
This fixes `test_reflection_pad2d_backward` and `test_vectorized_ops_masked_var_novec`  for MPS

TODO: When MacOS-14 is deprecated, replace it with lambda, which is a metal feature in MacOS-15, though I don't think they are substantially better then `if {} else {}` blocks

Pull Request resolved: pytorch#170134
Approved by: https://github.com/dcci, https://github.com/jansel
If one attempts to import `torch.utils.flop_count` on MacOS (or any other CPU system), they'll be greeted with the following repeated warning
```
% python -c "import torch.utils.flop_counter"
W0115 18:52:55.246000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.246000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.246000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0115 18:52:55.247000 34027 torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
```
which was introduced by pytorch#169876

Move import to top of flop_counter utils and only warn if system is compiled with GPU support that requires triton (i.e. CUDA/ROCM or XPU)

Pull Request resolved: pytorch#172614
Approved by: https://github.com/Lucaskabela
…9140 (pytorch#164069)

Fixes pytorch#139140

Pull Request resolved: pytorch#164069
Approved by: https://github.com/isuruf

Co-authored-by: Isuru Fernando <isuruf@gmail.com>
…kernel when segments are few but large (pytorch#172454)"

This reverts commit 02513f5.

Reverted pytorch#172454 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#172454 (comment)))
Resolves pytorch#172050

Two motivations:
- Give better UX and perf to users who explicitly use `symm_mem.empty()`.
- Simplify the code generated by Inductor, i.e. `symm_mem.empty()` would automatically reuse memory, rather than requiring Inductor to bookkeep it.

The MemPool infra for all CUDA backends (`CUDA`, `NVSHMEM`, `NCCL`) has been built previously.
Pull Request resolved: pytorch#172292
Approved by: https://github.com/ngimel, https://github.com/dzmitry-huba
ghstack dependencies: pytorch#172163
Asked from internal user.

Pull Request resolved: pytorch#172461
Approved by: https://github.com/v0i0
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/requirements-ci.txt
#	.ci/docker/triton_version.txt
#	.circleci/scripts/binary_populate_env.sh
#	.github/scripts/build_triton_wheel.py
#	test/test_sparse_csr.py
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api bot commented Jan 16, 2026

Jenkins build for 15aad50f71f23865cecf564138e10fd0c750d64c commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pragupta
Copy link
Copy Markdown
Collaborator Author

Closing in favor of : #2915

@pragupta pragupta closed this Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.